Goto

Collaborating Authors

 safety layer


DSCD: Large Language Model Detoxification with Self-Constrained Decoding

arXiv.org Artificial Intelligence

Detoxification in large language models (LLMs) remains a significant research challenge. Existing decoding detoxification methods are all based on external constraints, which require additional resource overhead and lose generation fluency. This work proposes Detoxification with Self-Constrained Decoding (DSCD), a novel method for LLM detoxification without parameter fine-tuning. DSCD strengthens the inner next-token distribution of the safety layer while weakening that of hallucination and toxic layers during output generation. This effectively diminishes toxicity and enhances output safety. DSCD offers lightweight, high compatibility, and plug-and-play capabilities, readily integrating with existing detoxification methods for further performance improvement. Extensive experiments on representative open-source LLMs and public datasets validate DSCD's effectiveness, demonstrating state-of-the-art (SOTA) performance in both detoxification and generation fluency, with superior efficiency compared to existing methods. These results highlight DSCD's potential as a practical and scalable solution for safer LLM deployments.


Tail-Safe Hedging: Explainable Risk-Sensitive Reinforcement Learning with a White-Box CBF--QP Safety Layer in Arbitrage-Free Markets

arXiv.org Artificial Intelligence

We introduce Tail-Safe, a deployability-oriented framework for derivatives hedging that unifies distributional, risk-sensitive reinforcement learning with a white-box control-barrier-function (CBF) quadratic-program (QP) safety layer tailored to financial constraints. The learning component combines an IQN-based distributional critic with a CVaR objective (IQN--CVaR--PPO) and a Tail-Coverage Controller that regulates quantile sampling through temperature tilting and tail boosting to stabilize small-$α$ estimation. The safety component enforces discrete-time CBF inequalities together with domain-specific constraints -- ellipsoidal no-trade bands, box and rate limits, and a sign-consistency gate -- solved as a convex QP whose telemetry (active sets, tightness, rate utilization, gate scores, slack, and solver status) forms an auditable trail for governance. We provide guarantees of robust forward invariance of the safe set under bounded model mismatch, a minimal-deviation projection interpretation of the QP, a KL-to-DRO upper bound linking per-state KL regularization to worst-case CVaR, concentration and sample-complexity results for the temperature-tilted CVaR estimator, and a CVaR trust-region improvement inequality under KL limits, together with feasibility persistence under expiry-aware tightening. Empirically, in arbitrage-free, microstructure-aware synthetic markets (SSVI $\to$ Dupire $\to$ VIX with ABIDES/MockLOB execution), Tail-Safe improves left-tail risk without degrading central performance and yields zero hard-constraint violations whenever the QP is feasible with zero slack. Telemetry is mapped to governance dashboards and incident workflows to support explainability and auditability. Limitations include reliance on synthetic data and simplified execution to isolate methodological contributions.


Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model

arXiv.org Artificial Intelligence

Vision-driven autonomous river following by Unmanned Aerial Vehicles is critical for applications such as rescue, surveillance, and environmental monitoring, particularly in dense riverine environments where GPS signals are unreliable. These safety-critical navigation tasks must satisfy hard safety constraints while optimizing performance. Moreover, the reward in river following is inherently history-dependent (non-Markovian) by which river segment has already been visited, making it challenging for standard safe Reinforcement Learning (SafeRL). To address these gaps, we propose three contributions. First, we introduce Marginal Gain Advantage Estimation, which refines the reward advantage function by using a sliding window baseline computed from historical episodic returns, aligning the advantage estimate with non-Markovian dynamics. Second, we develop a Semantic Dynamics Model based on patchified water semantic masks offering more interpretable and data-efficient short-term prediction of future observations compared to latent vision dynamics models. Third, we present the Constrained Actor Dynamics Estimator architecture, which integrates the actor, cost estimator, and SDM for cost advantage estimation to form a model-based SafeRL framework. Simulation results demonstrate that MGAE achieves faster convergence and superior performance over traditional critic-based methods like Generalized Advantage Estimation. SDM provides more accurate short-term state predictions that enable the cost estimator to better predict potential violations. Overall, CADE effectively integrates safety regulation into model-based RL, with the Lagrangian approach providing a "soft" balance between reward and safety during training, while the safety layer enhances inference by imposing a "hard" action overlay.


Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

arXiv.org Artificial Intelligence

Fine-tuning as service injects domain-specific knowledge into large language models (LLMs), while challenging the original alignment mechanisms and introducing safety risks. A series of defense strategies have been proposed for the alignment, fine-tuning, and post-fine-tuning phases, where most post-fine-tuning defenses rely on coarse-grained safety layer mapping. These methods lack a comprehensive consideration of both safety layers and fine-grained neurons, limiting their ability to efficiently balance safety and utility. To address this, we propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks. FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons while minimizing interference with downstream task neurons. We then project the safety neuron parameters onto safety directions, improving model safety while aligning more closely with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, by introducing a task-specific, multi-dimensional heterogeneous safety neuron cluster optimization mechanism, we achieve continual defense and generalization capability against unforeseen emerging safety concerns.


Synthesis of Deep Neural Networks with Safe Robust Adaptive Control for Reliable Operation of Wheeled Mobile Robots

arXiv.org Artificial Intelligence

Deep neural networks (DNNs) can enable precise control while maintaining low computational costs by circumventing the need for dynamic modeling. However, the deployment of such black-box approaches remains challenging for heavy-duty wheeled mobile robots (WMRs), which are subject to strict international standards and prone to faults and disturbances. We designed a hierarchical control policy for heavy-duty WMRs, monitored by two safety layers with differing levels of authority. To this end, a DNN policy was trained and deployed as the primary control strategy, providing high-precision performance under nominal operating conditions. When external disturbances arise and reach a level of intensity such that the system performance falls below a predefined threshold, a low-level safety layer intervenes by deactivating the primary control policy and activating a model-free robust adaptive control (RAC) policy. This transition enables the system to continue operating while ensuring stability by effectively managing the inherent trade-off between system robustness and responsiveness. Regardless of the control policy in use, a high-level safety layer continuously monitors system performance during operation. It initiates a shutdown only when disturbances become sufficiently severe such that compensation is no longer viable and continued operation would jeopardize the system or its environment. The proposed synthesis of DNN and RAC policy guarantees uniform exponential stability of the entire WMR system while adhering to safety standards to some extent. The effectiveness of the proposed approach was further validated through real-time experiments using a 6,000 kg WMR.


Self-Aware Safety Augmentation: Leveraging Internal Semantic Understanding to Enhance Safety in Vision-Language Models

arXiv.org Artificial Intelligence

Large vision-language models (LVLMs) are vulnerable to harmful input compared to their language-only backbones. We investigated this vulnerability by exploring LVLMs internal dynamics, framing their inherent safety understanding in terms of three key capabilities. Specifically, we define these capabilities as safety perception, semantic understanding, and alignment for linguistic expression, and experimentally pinpointed their primary locations within the model architecture. The results indicate that safety perception often emerges before comprehensive semantic understanding, leading to the reduction in safety. Motivated by these findings, we propose \textbf{Self-Aware Safety Augmentation (SASA)}, a technique that projects informative semantic representations from intermediate layers onto earlier safety-oriented layers. This approach leverages the model's inherent semantic understanding to enhance safety recognition without fine-tuning. Then, we employ linear probing to articulate the model's internal semantic comprehension to detect the risk before the generation process. Extensive experiments on various datasets and tasks demonstrate that SASA significantly improves the safety of LVLMs, with minimal impact on the utility.


Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM

arXiv.org Artificial Intelligence

Large visual-language models (LVLMs) integrate aligned large language models (LLMs) with visual modules to process multimodal inputs. However, the safety mechanisms developed for text-based LLMs do not naturally extend to visual modalities, leaving LVLMs vulnerable to harmful image inputs. To address this cross-modal safety gap, we introduce security tensors - trainable input vectors applied during inference through either the textual or visual modality. These tensors transfer textual safety alignment to visual processing without modifying the model's parameters. They are optimized using a curated dataset containing (i) malicious image-text pairs requiring rejection, (ii) contrastive benign pairs with text structurally similar to malicious queries, with the purpose of being contrastive examples to guide visual reliance, and (iii) general benign samples preserving model functionality. Experimental results demonstrate that both textual and visual security tensors significantly enhance LVLMs' ability to reject diverse harmful visual inputs while maintaining near-identical performance on benign tasks. Further internal analysis towards hidden-layer representations reveals that security tensors successfully activate the language module's textual "safety layers" in visual inputs, thereby effectively extending text-based safety to the visual modality.


Towards Safe Robot Foundation Models Using Inductive Biases

arXiv.org Artificial Intelligence

Safety is a critical requirement for the real-world deployment of robotic systems. Unfortunately, while current robot foundation models show promising generalization capabilities across a wide variety of tasks, they fail to address safety, an important aspect for ensuring long-term operation. Current robot foundation models assume that safe behavior should emerge by learning from a sufficiently large dataset of demonstrations. However, this approach has two clear major drawbacks. Firstly, there are no formal safety guarantees for a behavior cloning policy trained using supervised learning. Secondly, without explicit knowledge of any safety constraints, the policy may require an unreasonable number of additional demonstrations to even approximate the desired constrained behavior. To solve these key issues, we show how we can instead combine robot foundation models with geometric inductive biases using ATACOM, a safety layer placed after the foundation policy that ensures safe state transitions by enforcing action constraints. With this approach, we can ensure formal safety guarantees for generalist policies without providing extensive demonstrations of safe behavior, and without requiring any specific fine-tuning for safety. Our experiments show that our approach can be beneficial both for classical manipulation tasks, where we avoid unwanted collisions with irrelevant objects, and for dynamic tasks, such as the robot air hockey environment, where we can generate fast trajectories respecting complex tasks and joint space constraints.


Multi-Constraint Safe Reinforcement Learning via Closed-form Solution for Log-Sum-Exp Approximation of Control Barrier Functions

arXiv.org Artificial Intelligence

The safety of training task policies and their subsequent application using reinforcement learning (RL) methods has become a focal point in the field of safe RL. A central challenge in this area remains the establishment of theoretical guarantees for safety during both the learning and deployment processes. Given the successful implementation of Control Barrier Function (CBF)-based safety strategies in a range of control-affine robotic systems, CBF-based safe RL demonstrates significant promise for practical applications in real-world scenarios. However, integrating these two approaches presents several challenges. First, embedding safety optimization within the RL training pipeline requires that the optimization outputs be differentiable with respect to the input parameters, a condition commonly referred to as differentiable optimization, which is non-trivial to solve. Second, the differentiable optimization framework confronts significant efficiency issues, especially when dealing with multi-constraint problems. To address these challenges, this paper presents a CBF-based safe RL architecture that effectively mitigates the issues outlined above. The proposed approach constructs a continuous AND logic approximation for the multiple constraints using a single composite CBF. By leveraging this approximation, a close-form solution of the quadratic programming is derived for the policy network in RL, thereby circumventing the need for differentiable optimization within the end-to-end safe RL pipeline. This strategy significantly reduces computational complexity because of the closed-form solution while maintaining safety guarantees. Simulation results demonstrate that, in comparison to existing approaches relying on differentiable optimization, the proposed method significantly reduces training computational costs while ensuring provable safety throughout the training process.


Towards Safe Robot Foundation Models

arXiv.org Artificial Intelligence

Robot foundation models hold the potential for deployment across diverse environments, from industrial applications to household tasks. While current research focuses primarily on the policies' generalization capabilities across a variety of tasks, it fails to address safety, a critical requirement for deployment on real-world systems. In this paper, we introduce a safety layer designed to constrain the action space of any generalist policy appropriately. Our approach uses ATACOM, a safe reinforcement learning algorithm that creates a safe action space and, therefore, ensures safe state transitions. By extending ATACOM to generalist policies, our method facilitates their deployment in safety-critical scenarios without requiring any specific safety fine-tuning. We demonstrate the effectiveness of this safety layer in an air hockey environment, where it prevents a puck-hitting agent from colliding with its surroundings, a failure observed in generalist policies.